Tight Policy Regret Bounds for Improving and Decaying Bandits
نویسندگان
چکیده
We consider a variant of the well-studied multiarmed bandit problem in which the reward from each action evolves monotonically in the number of times the decision maker chooses to take that action. We are motivated by settings in which we must give a series of homogeneous tasks to a finite set of arms (workers) whose performance may improve (due to learning) or decay (due to loss of interest) with repeated trials. We assume that the arm-dependent rates at which the rewards change are unknown to the decision maker, and propose algorithms with provably optimal policy regret bounds, a much stronger notion than the often-studied external regret. For the case where the rewards are increasing and concave, we give an algorithm whose policy regret is sublinear and has a (provably necessary) dependence on the time required to distinguish the optimal arm from the rest. We illustrate the behavior and performance of this algorithm via simulations. For the decreasing case, we present a simple greedy approach and show that the policy regret of this algorithm is constant and upper bounded by the number of arms.
منابع مشابه
Open Problem: First-Order Regret Bounds for Contextual Bandits
We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of Õ( √ L?K lnN) where there areK actions,N policies, andL? is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret Õ(L ? poly(K, ln(N/δ))). We describe some positive results, such as an ineff...
متن کاملBatched Bandit Problems
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy that operates under this contraint and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optima...
متن کاملOnline Regret Bounds for Markov Decision Processes with Deterministic Transitions
We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an...
متن کاملTight Regret Bounds for Stochastic Combinatorial Semi-Bandits
A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we an...
متن کاملMultiarmed Bandits in the Worst Case
We present a survey of results on a recently formulated variant of the classical (stochastic) multiarmed bandit problem in which no assumption is made on the mechanism generating the rewards. We describe randomized allocation policies for this variant and prove bounds on their regret as a function of the time horizon and the number of arms. These bounds hold for any assignment of rewards to the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016